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^ . Abstract 

^ I We analyze and improve low rank representation (LRR), the state-of-the-art algo- 

^ ' rithm for subspace segmentation of data. We prove that for the noiseless case, the op- 

timization model of LRR has a unique solution, which is the shape interaction matrix 
(SIM) of the data matrix. So in essence LRR is equivalent to factorization methods. We 
also prove that the minimum value of the optimization model of LRR is equal to the 
- rank of the data matrix. For the noisy case, we show that LRR can be approximated as 

ly-v I a factorization method that combines noise removal by column sparse robust PC A. We 

further propose an improved version of LRR, called Robust Shape Interaction (RSI), 

[~^ ■ which uses the corrected data as the dictionary instead of the noisy data. RSI is more 

O I robust than LRR when the corruption in data is heavy. Experiments on both synthetic 

and real data testify to the improved robustness of RSI. 

>< 1 Introduction 

In many computer vision and machine learning problems, one often assumes that the data 
is drawn from a union of multiple linear subspaces. Thus subspace segmentation of such 
data has been studied extensively. The existing methods for subspace segmentation can be 
roughly divided into four groups: statistical learning based methods ([[II [2|[), factorization 
based methods ([|3l|4l[5l), algebra based methods [6], and sparsity based methods (e.g., SSC 
[[aandLRR[[l). 
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LRR [[HI is a recently proposed method and is reported to have excellent performance on 
both synthetic and benchmark data sets. For the noiseless case, LRR takes the data itself as 
a dictionary and seeks the representation matrix with the lowest rank. LRR can also handle 
noisy data by adding a (2,l)-norm term to the objective function in order to make the noise 
column sparse. The experimental results show that it is both robust and accurate. 

However, the motivation to utilize low rank criterion in LRR remains vague. In [8J, the 
authors only proved that for noiseless case, the convex optimization model of LRR admits a 
block diagonal solution, which is the representation they seek. As the nuclear nomu is not 
strongly convex, it is unclear whether such a solution is unique and what it actually is. If the 
uniqueness is not guaranteed, we may be at risk of finding a non-block-diagonal solution. In 
this case the clustering information will not be revealed. In the case of noisy data, the authors 
simply added a (2,l)-norm term to the objective function without providing sufficient insight 
to why such treatment can work well. 

Our Contributions 

In this paper, we present detailed analysis of LRR. We find that although LRR was catego- 
rized by the authors of [8J as a sparsity based method, it is actually closely related to the 
factorization methods. Our main contributions are: 

1. We prove that in the noiseless case, LRR has a unique solution, which is exactly the 
shape interaction matrix of the data matrix. Consequently, the minimum objective 
function value is the rank of the data matrix. 

2. For the noisy case, we show that LRR can be roughly regarded as first applying a 
column sparse robust PCA [9] to remove noise, and then performing segmentation on 
the corrected data. 

3. We propose a modified model, called Robust Shape Interaction (RSI) due to its depen- 
dence on the shape interaction matrix, as an improvement of LRR. Our experiments 



'The nuclear norm of a matrix is the sum of the singular values of the matrix. It is the convex envelope of 
the rank function on the unit matrix 2-norm ball and is often used as an approximation of rank. 



show that RSI is more robust and has better performance than LRR. 

The remainder of this paper is organized as follows. Section previews the related state-of- 
the-art methods. Section [3] studies the relationship between LRR and factorization methods. 
Then Section |4] introduces RSI as an improvement of LRR. The simulation results on syn- 
thetic and real data are shown in Section[51 Finally, Section [6] concludes the paper. 

2 Review of LRR and Factorization Methods 
2.1 Basic Subspace Segmentation Problem 

Let X = [xi,X2, ■■■,Xn]hea collection of m dimensional data vectors drawn from a union of 
linear subspaces {5^}^^^, where the dimension of iSj is r^. The task of subspace segmentation 
(or clustering) is to cluster the vectors in X according to those subspaces. 

For notational simplicity, we may assume X = [Xi,X2, ...,Xk], where Xi consists of 
the vectors in Si. For LRR and factorization methods, it is assumed that the subspaces are 
independent. 

Denote di as the number of vectors in X^. Then there must be at least one block diagonal 
matrix Z = diag{Zi, Z2, ..., Z^} satisfying 

X = XZ, (1) 

where the size of the i-th block Zi is di x di. Equation ([T]) actually has an infinite amount 
of solutions. Any solution is called a representation matrix. Note that the block diagonal 
structure of Z directly induces segmentation of the data (each block corresponds to a cluster). 
So the clustering task is equivalent to finding a block diagonal representation matrix Z. 



^If Va e span{iSi,iS2, ...,iSfc}, the decomposition a = X]i=i c^j' where ai £ Si, is unique, then we say 
that the subspaces 81,62, ■■■,Sk are independent. 



2.2 Subspace Segmentation by Low-rank Representation 

As the solution to ([B is not unique, LRR |8| seeks the lowest rank representation matrix. 
When the data is noiseless, LRR solves the following optimization problem: 

min||Z||,, s.t. X = XZ, (2) 

where || ■ || * denotes the nuclear norm. It was proved in |[8l that the solution set of ^ includes 
at least a block diagonal representation matrix Z^ that can be used for clustering. 
When the data is noisy, the optimization model of LRR is formulated as: 

min||Z|L + A||E||2i, s.t.X = XZ^E, (3) 



where ||-E'||2,i = Xl^i \l^YTi=A^'ii) ^^ '^he (2,l)-norm. Minimizing the (2,l)-norm of noise 
is to meet the assumption that the corruptions are "sample specific" [8], i.e., some data 
vectors are corrupted and the others are clean. Since in this case, the solution Z* to ([3]) may 
not be block diagonal, it is recognized as an affinity matrix instead and spectral clustering 
methods are applied to \Z*\ + |(^*)'| to obtain a block diagonal matrix, where ' denotes the 
matrix or vector transpose and \A\ denotes a matrix whose entries are the absolute values of 
A. 

IJh Factorization Methods and Shape Interaction Matrix 

Factorization based methods build a similarity matrix by factorizing the data matrix and then 
applying spectral clustering to the similarity matrix for clustering. This similarity matrix, 
which is called the shape interaction matrix (SIM) [[3l| in computer vision, is defined as 
SIM{X) = VjV^, where X = UrSjVl. is the skinny singular value decomposition (SVD) of 
X and r is the rank of X. When the data is noiseless, we have [|3l: 

Theorem 2.1 (Costeira and Kanade) Under the assumption that the subspaces are indepen- 
dent and the data X is clean, SIM{X) is a block diagonal matrix that has exactly k blocks. 
Moreover, the i-th block on its diagonal is of size di x di. 



We can further have: 

Theorem 2.2 The rank of the i-th diagonal block of SIM[X) is Tj. 

Proof: We partition V'. as V'. = [V^/^, V/2' ••■) K'fc]' where the number of columns in V/ • is 
di. Then K,iK'i is the i-th diagonal block of SIM{X) and X,, = UrSrV^^i- We can also 
have y/j = S^^U'^Xi. So rank(V^/j) = rank(Xj) = Tj. The theorem is proved by using the 
relationship rank(K,jK'j) = rank(y/j). 

D 

Theorem 12.11 is the theoretical foundation of why SIM can serve as the similarity matrix 
for subspace segmentation. When the data contains noise, the SIM of the data matrix can 
still be computed. Although in this case the SIM may not be block diagonal, it can be made 
block diagonal, e.g., by applying spectral clustering methods. 

3 Relationship between LRR and Factorization Methods 

Though LRR was proposed as a sparsity based method, we show in this section that it is 
equivalent to the factorization methods. This is revealed by the following theorem: 

Theorem 3.1 The shape interaction matrix SIM{X) is the unique solution to the optimiza- 
tion problem (|2]) and the minimum objective function value is rank(X). 

Proof: Let [Ux, Sx, Vx] and [Uz, Sz, Vz] be the full SVD of X and Z, respectively. Denote 
M = V'xVz and A^ = VJJz- They are both orthogonal matrices. Then X = XZ is equivalent 
to 

SxM = SxNSz. (4) 

Suppose X is of rank r and M and A^ are partitioned as M = I , , ^ J and A^ = 

, . '^ J , respectively, where Mr means that it consists of r rows of M, etc. Then (Hj) 
reduces to 

Mr = NrSz. (5) 



Now consider Q. Since 

by Lemma 3. 1 in (HI, we have 

||iVr-^.M;||, < \\NS,M'\U - \\Nn-rS,M'^_M. 

As WNrS^MlW^ = ||M,.M;||* = r, where © is applied, and \\NS,M'\\^ = \\S,\\^ = \\Z\\^, 
the above inequality reduces to 

r < \\Z\U - ||iV„„,^,M;_J|, < ||Z||,. (7) 

Noticing \\SIM{X)\\^ = r and X = X ■ SIM{X), we conclude that the optimal Z must 
satisfy \\Z\\^ = r, i.e., the minimum objective function value is rank(X), and SIM{X) is 
one of the optimal solutions. 

Next, we prove that SIM{X) is the unique solution to Q- Suppose that Zq is a solution 
to ©. First, by ^ A^^-S'^o^n-r = MrM'^_^ = 0. Note that here and in the sequel, A^^ and 
Mn-r depend on zq. Second, since ||^o||* = ^^ from ^ we have Nn-rSz^M'^_^ = 0. Thus 
^ reduces to 

which implies the rank of Zq must be r, i.e., S^,, has only r nonzero entries on its diagonal. 
We then further partition M^ and A^^ as M^ = [M^^r, Mr,n-i] and N,,. = [N^^r, A^r,n-r] > respec- 
tively, where Mr,r consists of r columns of M^, etc., and write Szq = diagj^i, Z2, ..., Zr, 0, ..., 0}, 
where zi > Z2 > ■■■ > Zr > 0. Then by comparing both sides of ^ we have Mr^n-r = 0. 
So Mr^r is an orthogonal matrix and Mr,r = Nr^r diag{2;i, Z2, ...^Zr}. We further denote M^^ 
and A^^^ as the last columns of M,. ,,. and Nr^r, respectively. Then M^^ = ZrN^j. and thus we 
have 

||m;j|2 1 

>1, (9) 



II ryrlM II r/i 

where || • ||2 denotes the 2-norm of a vector. The last inequality holds because N^^ is part 
of a column of the orthogonal matrix A^. Since Zr is the smallest nonzero singular value 



of Zq and ||^o||* = i^^ ® implies that all the nonzero singular values of Zq must be 1. So 
we conclude that Mr-r = Nr^r, Szg = diag{/r, On~r} and hence both M and A^ are block 
diagonal. Finally, 

= y,diag{/„o„_,}K' = VrV; = siM{x). 

n 

As the uniqueness of the solution to LRR is guaranteed, we can always use LRR to cluster 
clean data. Moreover, that the solution is SIM{X) can help us to understand the noisy LRR 
model ©. 

Understanding the Noisy LRR Model 

Now consider the noisy case. It was not completely clear why solving ([3]) is effective in 
removing the column sparse noise. Note that LRR uses the noisy data X itself as the dictio- 
nary instead of the clean data D, which is not quite reasonable when the noise is heavy or 
the percentage of outliers is relatively large. If we use the clean data as the dictionary, the 
noisy LRR model Q would change to: 

min ||Z|L + All^lhi, s.t. D = DZ, X = D + E. (10) 

Z,D,E 

By Theorem im it is straightforward to see that (flOl) is equivalent to 

minrank(D) + A||E||2,i, s.t.X = D + E, (11) 

D.E 

and Z = SIM{D). Model ^^ is very similar to the robust PCA model ^. It would 
decompose the data into two parts: one is of low rank and the other is column sparse. So 
we call problem ^^ column sparse robust PCA (CSRPCA). Denote S = ©f=i>Si. Then 
the noise in the data can be decomposed into parts: E = Eg + E^, where each column of 
Eg belongs to space S and each column of Ej- is orthogonal to S. As rank minimization 
methods are to remove noise outside the span of the clean data, CSRPCA is effective in 
removing Ej- but is unable to remove E^. However in many real problems, the dimension 



of S (i.e., the rank r of clean data) is much smaller than the dimension of data. So with high 
probability, ll-E'sHi? is much smaller than Ili^^Hi?, where || ■ \\f denotes the Frobenius norm. 
Consequently, CSRPCA is able to remove most of the noise. 

So we can see that the LRR model for noisy data is an approximation of (fTOl ). Solving 
(fTOl) is equivalent to first applying CSRPCA to remove the noise that is orthogonal to the 
space spanned by the clean data, then computing the SIM to build the affinity matrix. Since 
the noise level is greatly reduced, spectral clustering methods on such an affinity matrix can 
be very effective to segment the data into subspaces. This explains why LRR can have good 
performance on noisy data. 

4 RSI: an Improved Version of LRR 

That X itself can be used as the dictionary is based on the assumption that the percentage 
of outliers is small and the noise level is low. When this is not true, a more reasonable 
formulation is (flOl ). or equivalently (fTTl) . Following the convention, we may replace the 
rank function in (fTTl) with the nuclear norm, giving rise to the following convex optimization 
problem: 

min||D|L + A||^||2i, s.t. X = D + E. (12) 

D,E 

Similar to robust PCA [JU, the above problem can be solved by the inexact augmented La- 
grange Multiplier algorithm [[TOl , which is based on the following Lagrangian function: 

L{D,E,Y) = \\D\U + X\\E\\2,i + {Y,X-D-E) 
+ ^\\X - D - E\\l, 

where Y is the Lagrange multiplier, // is a positive penalty parameter, and {A, B) = ti{A'B) 
is the inner product of matrices. 

When the "clean" data D is obtained, its representation matrix can be obtained as Z = 
SIM(D). As D still contains noise Eg, Z may not be a block diagonal matrix. Like LRR 
and SSC f7\, spectral clustering is also applied to \Z\ to reveal the subspace clustering in- 
formation. Note that as SIM(D) is symmetric, we do not have to use \Z\ + \Z'\ as the 
affinity matrix. We call this method Robust Shape Interaction (RSI). The pseudo-code for 
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Algorithm 1 Robust Shape Interaction 



Input: data matrix X, parameter A > 0. 

Initialize : ^o, ^o, Ato > 0, /imax > ^o, P > 1, e > 0. 

while ||X — Dfc — i?fc||oo > e do 

1. Update D by solving Dfc+i = argminL(D, iJ^, 1"^) = argmin \\D\\^ + ^\\X — D 



E,-n7^Y,^^^ 



k - fJ-k'^kWy- 

{U, S, V) = svd(X -Ek- fi.'Yk), Dfc+1 = UQ^^-i [S]V^ . 



2. Update E hy solving E^+i = aigmin L{Dk^i, E,Yk) = argmin A||i?||2,i + ^\\X — 

Dk+i — E — fif^ YkWp-. 

Suppose the i-th column of X — D^+i — l^k^^k is qi, then the i-th column of E^^i is 

1 1 y j 1 1 2 

3. Update Y by: Yk+i =Yk + ^k{X- Dk+i - Ek+i). 

4. Update ^ by: /ifc+i = min{pfik, /Xmax)- 

5. k ^ k + 1. 
end while. 

Compute Z = S/Af(L'). 

Perform spectral clustering on \Z\. 

Output : The subspace clusters indicated by the blocks of processed \Z\. 

RSI is presented in Algorithm[Tl where Qe{x) = max(|x| — e, 0) sgn(x) is the thresholding 
operator. Readers are encouraged to refer to [ITOl and (H for the deduction of the formulae 
in Algorithm [U 

Although SRI does not make a big change to LRR, as CSRPCA removes most of the 
noise and SRI uses relatively clean data as the dictionary, RSI is more robust than LRR, 
particularly when the data is heavily corrupted. 

5 Experimental Results 
Clustering Synthetic Data 

We first compare the robustness of LRR and RSI on synthetic data. We construct 5 indepen- 
dent subspaces, each having a dimension of 4, and sample 20 data vectors with dimension 
100 from each subspace. For each data point p, small Gaussian noise of variance 0.1 * ||p|| 2 
is added. Moreover, we randomly choose a certain percentage of points as outliers. For 



each outlier point Poutuer we add large Gaussian noise of variance ||poMtMer||2 to it. Then we 
test LRR and RSI on this corrupted data. The parameters are chosen as Xlrr = 0.12 and 
Xrsi = 0-6, respectively, which are both the optimal for achieving the highest segmentation 
accuracy. For each percentage of outliers, we repeat the experiment 20 times and record the 
average accuracy and standard deviation. As shown in Figure [H RSI has better performance 
than LRR when the percentage of outliers increases and its performance is very stable. In 
this experiment, the maximum standard deviation of RSI is 0.0403 and the average standard 
deviation is 0.0306. Thus RSI is more robust than LRR on this synthetic data. 



O 1< 

.9 0.6- 

S 0.4- 

E 

S 0.2 




-LRR 
RSI 



^ — ^ — ^ — ^ 



-o — <> 



O O O — ^ 

"0 10 20 , 30 .. 40 50 

nfa 



percentage of corruption 

Figure 1: Segmentation accuracy of LRR and RSI. The parameters are set as \lb.r = 0.12 
and Xrsi = 0.6, respectively. 



Clustering Real Data 

We then test the Hopkins 155 motion database [[TT]| in this experiment. The database contains 
156 video sequences and each of them is a clustering task. For each sequence, there are 
39 ~ 550 data vectors belonging to two or three motions, each motion corresponding to a 
subspace. We first repeat the same experiment, including the preprocessing, as done in LRR 
((HI, Section 4.2) and then compare it with RSI. The results are shown in Table I. Note that 
after preprocessing, the data only contains slight corruptions. So both RSI and LRR perform 
well. Evaluated by the average performance, RSI outperforms LRR on this slightly corrupted 
dataset. 

We finally test with the Extended Yale Database B [12J, which consists of 640 frontal 
face images of 10 subjects (there are 38 subjects in the whole database and we use the first 
10 subjects for our experiment). Each subject contains about 64 images. The corruptions 
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METHOD 


MEAN 


MEDIAN 


STD 


LRR 
RSI 


4.3673 
2.8501 


0.4717 



7.4540 
7.5858 



Table 1: Segmentation error rate (%) on the Hopkinl55 database. The parameters are se- 
lected as Xlrr = 2.4 and Xjigj = 0.24, respectively. 




Figure 2: Two examples of using LRR and RSI to correct corruptions in face images in 
the Extended Yale Database B. The parameters are set as Xlrr = 0.05 and Xrsi = 0.6, 
respectively. The first row shows two original faces in the database. The second and third 
rows show the denoising result by LRR and RSI, respectively. For each face, the corrected 
images are displayed on the left, and the noise images are displayed on the right. 

in this database is heavy as more than half of the face images contain shadows or specular 
lights. As did in LRR, we resize the images into 48 x 42 pixels and use the raw pixel values 
to form data vectors of dimension 2016. The parameters are chosen as Xlrr = 0.05 and 
Xrsi = 0.6 for LRR and RSI, respectively, which are both optimal for achieving the lowest 
segmentation error, based on our reimplemented code. The clustering accuracies of LRR 
and RSI are 55.3 1 % and 58.13%, respectively. So RSI also performs better than LRR on this 
heavily corrupted dataset. 

Both LRR and RSI can be used to remove noise. Figure [2] shows two examples. From the 
figure, it is clear that RSI is able to remove noise as well as LRR. 
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6 Conclusions 

We have analyzed LRR and revealed its close relationship with the factorization methods. 
In particular, we prove that when the data is clean, the LRR problem has a unique solution, 
which is exactly the shape interaction matrix of the data matrix, and the minimum objective 
function value is the rank of the data matrix. We further propose the Robust Shape Interaction 
method, which first removes the noise by column sparse robust PCA, and then computes the 
SIM of the corrected data for subspace segmentation. RSI is verified by experiments to be 
more robust than LRR. 

References 

[1] Y. Ma, H. Derksen, W. Hong, and J. Wright, "Segmentation of multivariate mixed data 
via lossy data coding and compression," IEEE Transactions on Pattern Analysis and 
Machine Intelligence, pp. 1546-1562, 2007. 

[2] A. Yang, S. Rao, and Y Ma, "Robust statistical estimation and segmentation of multiple 
subspaces," in Computer Vision and Pattern Recognition Workshop on 25 Years on 
RANSAC, 2006, pp. 99-107. 

[3] J. Costeira and T. Kanade, "A multibody factorization method for independently mov- 
ing objects," International Journal of Computer Vision, vol. 29, no. 3, pp. 159-179, 
1998. 

[4] A. Gruber and Y Weiss, "Multibody factorization with uncertainty and missing data 
using the EM algorithm," in IEEE Conference on Computer Vision and Pattern Recog- 
nition, vol. 1, 2004, pp. 707-714. 

[5] R. Vidal, R. Tron, and R. Hartley, "Multiframe motion segmentation with missing 
data using powerfactorization and GPCA," International Journal of Computer Vision, 
vol. 79, no. 1, pp. 85-105, 2008. 



12 



[6] Y. Ma, A. Yang, H. Derksen, and R. Fossum, "Estimation of subspace arrangements 
with applications in modeling and segmenting mixed data," SIAM Review, vol. 50, 
no. 3, pp. 413-458, 2008. 

[7] E. Elhamifar and R. Vidal, "Sparse subspace clustering," in IEEE Conference on Com- 
puter Vision and Pattern Recognition, vol. 2, 2009, pp. 2190-2191 . 

[8] G. Liu, Z. Lin, and Y Yu, "Robust subspace segmentation by low-rank representation," 
in International Conference of Machine Learning, 2010. 

[9] J. Wright, A. Ganesh, S. Rao, and Y Ma, "Robust principal component analysis: Exact 
recovery of corrupted low-rank matrices via convex optimization," submitted to Journal 
of the ACM, 2009. 

[10] Z. Lin, M. Chen, L. Wu, and Y Ma, "The augmented Lagrange multiplier method for 
exact recovery of corrupted low-rank matrix," submitted to Mathematical Program- 
ming, 2009. 

[11] R. Tron and R. Vidal, "A benchmark for the comparison of 3-D motion segmentation 
algorithms," in IEEE Conference on Computer Vision and Pattern Recognition, 2007, 
pp. 1-8. 

[12] K. Lee, J. Ho, and D. Kriegman, "Acquiring linear subspaces for face recognition under 
variable lighting," IEEE Transactions on Pattern Analysis and Machine Intelligence, 
pp. 684-698, 2005. 



13 



